Skip to content

[PyTorch debug] Fix issue with tp_group=None#2733

Open
pggPL wants to merge 1 commit intoNVIDIA:mainfrom
pggPL:fix_debug_hang
Open

[PyTorch debug] Fix issue with tp_group=None#2733
pggPL wants to merge 1 commit intoNVIDIA:mainfrom
pggPL:fix_debug_hang

Conversation

@pggPL
Copy link
Collaborator

@pggPL pggPL commented Mar 4, 2026

Description

If no TP is used, then tp_size=1 and tp_group=None inside TE modules.
In the debug feature utility get_reduction_params(), when TEDebugState.weight_tensor_tp_group_reduce=True, we previously set reduction_group = tp_group unconditionally. With tp_group=None, downstream distributed collectives interpret this as the default/world group, which can cause unintended cross-rank reduction for weight stats.

This PR fixes that behavior by making TP-group reduction explicit and safe:

  • if weight_tensor_tp_group_reduce=True but tp_group is None, skip reduction for weight stats;
  • if weight_tensor_tp_group_reduce=True and tp_group is available, reduce in that TP group;
  • existing behavior for other paths remains unchanged.

This prevents accidental world-group reductions when TP is effectively disabled (tp_size=1 / tp_group=None) while preserving the intended semantics of "reduce only within TP group" for weight tensors.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
@pggPL
Copy link
Collaborator Author

pggPL commented Mar 4, 2026

/te-ci pytorch L1

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 4, 2026

Greptile Summary

This PR fixes a subtle bug in get_reduction_params() within the debug utilities where passing tp_group=None (the case when TP is disabled, i.e. tp_size=1) would inadvertently assign None to reduction_group. Since group=None in PyTorch distributed collectives resolves to the default world process group, this caused unintended cross-rank reduction of weight tensor statistics across all ranks rather than restricting it to the intended TP group.

Key changes:

  • In transformer_engine/debug/features/utils/__init__.py, the get_reduction_params() function now guards the reduction_group = tp_group assignment behind a tp_group is not None check (line 26).
  • When weight_tensor_tp_group_reduce=True but tp_group is None, skip_reduction is set to True (line 30), preventing the accidental world-group reduction entirely.
  • A clear inline comment (lines 24-25) explains why None must not be assigned to reduction_group.
  • All callers (log_tensor_stats.py, log_fp8_tensor_stats.py, log_nvfp4_tensor_stats.py) remain unaffected as the API contract (skip_reduction, reduction_group, reduce_within_microbatch) is unchanged.

Confidence Score: 5/5

  • This PR is safe to merge; it's a minimal, targeted bug fix that prevents real world-group reduction issues when TP is disabled.
  • The fix is a single, well-scoped guard check that prevents a concrete bug (world-group reduction when tp_group=None) without altering any existing code paths. The logic is straightforward, inline comments clearly explain the intent, and the change is localized to a utility function with a stable API contract.
  • No files require special attention.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["get_reduction_params(tensor_name, tp_group)"] --> B{tensor_name == 'weight'?}
    B -- No --> C["skip_reduction=False<br/>reduction_group=default<br/>reduce_within_microbatch=True"]
    B -- Yes --> D{weight_tensor_tp_group_reduce?}
    D -- False --> E["skip_reduction=True<br/>reduce_within_microbatch=False"]
    D -- True --> F{tp_group is not None?}
    F -- Yes --> G["reduction_group=tp_group<br/>skip_reduction=False<br/>reduce_within_microbatch=False"]
    F -- No --> H["skip_reduction=True ✅ FIXED<br/>reduce_within_microbatch=False<br/>(was: reduction_group=None → world group 🐛)"]
Loading

Last reviewed commit: c2adcaf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant